Yelp Reviews: Authorship Attribution with Python and scikit-learn

When people write text, they do so in their own specific style. Often, it's possible to identify someone using only their unique style of writing. In this post, we'll see how easy it is to identify people using their writing style through machine learning. Specifically, we'll look at reviewers who have left multiple reviews on Yelp. We'll teach a machine learning system to differentiate between different writing styles, and then see how well it can predict the correct author of a review, looking only at the review text.

Overview

In this post, we will:

  • Load over 4 million reviews from the 2017 Yelp Dataset challenge
  • Find users who have left at least 500 reviews
  • Train a support vector machine classifier to identify the writing style of each author
  • See how well the classifier can identify reviews it hasn't seen during training

We'll use Python and Jupyter Notebook to develop our system, relying on scikit-learn for the machine learning components. Jupyter allows us to easily run sections of code as we progress. We can easily view the output of each 'cell' of code, and make changes if necessary, without needing to re-run the entire script.

Prerequisites

To follow along with this post, you should have used Python before. It'll help if you know some basics of machine learning, but we'll explain everything we do in detail, so you should be able to keep up even if you're just getting started. You'll need a working Python evironment with Jupyter notebook and scikit-learn installed. You can install both by running:

pip3 install jupyter sklearn

You'll also need to download the Yelp dataset from https://www.yelp.com/dataset_challenge. The data is a 1.7 GB tar ball, which decompresses to over 4 GB of JSON files, so you'll need some free disk space. Once you've downloaded the data and decompressed it, start up a Jupyter notebook in the same directory that you saved the dataset to, and create a new Python 3 notebook. If you prefer, you can simply download this tutorial from <> and open it with Jupyter. You will then be able to run all the code while reading the explanatory text, making any changes to the code that you want.

Reading in the review data

First, we'll load the Yelp reviews from disk into a Python list. We put %%time at the beginning of the cell to tell Jupyter to indicate how long the code took to run. On my machine, parsing the 4 million reviews took about a minute and a half.


In [162]:
%%time

import json
reviews = []
with open("yelp_academic_dataset_review.json") as f:
    for line in f:
        reviews.append(json.loads(line))


CPU times: user 52.2 s, sys: 28.7 s, total: 1min 20s
Wall time: 1min 34s

We'll take a look at the first review so we know how the data is structured. For this post, we'll only be using the user_id and text fields (the user IDs have been anonymized, but if two reviews have the same user_id, it means that they were written by the same user).


In [166]:
print(reviews[0])


{'review_id': 'NxL8SIC5yqOdnlXCg18IBg', 'user_id': 'KpkOkG6RIf4Ra25Lhhxf1A', 'business_id': '2aFiy99vNLklCx3T_tGS9A', 'stars': 5, 'date': '2011-10-10', 'text': "If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and enjoyable. I hope not to need him again (though this is highly unlikely) but knowing he is there if I do is very nice. By the way, I'm not from El Centro, CA. but Scottsdale, AZ.", 'useful': 0, 'funny': 0, 'cool': 0, 'type': 'review'}

Getting the top reviewers

We're only interested in the users who have left multiple reviews for our analysis. Because most users have left only one or two reviews (see http://www.developintelligence.com/blog/2017/02/analyzing-4-million-yelp-reviews-python-aws-ec2-instance/ for a more detailed breakdown), we'll be exluding a large portion of the reviews. We can find the reviewers who have left the most reviews efficiently by using a Python Counter. We'll take the 80 top reviewers, who have all left at least 500 reviews in this dataset.


In [163]:
from collections import Counter
prolific_reviewers = Counter([review['user_id'] for review in reviews]).most_common(80)

Let's take a look at the top 5 reviewers.


In [168]:
prolific_reviewers[:5]


Out[168]:
[('CxDOIDnH8gp9KXzpBHJYXw', 3327),
 ('bLbSNkLggFnqwNNzzq-Ijw', 1795),
 ('PKEzKWv_FktMm2mGPjwd0Q', 1509),
 ('QJI9OSEn6ujRCtrX06vs1w', 1316),
 ('DK57YibC5ShBmqQl97CKog', 1266)]

One person has left over 3000 reviews on yelp! The drop-off is quite steep, but the 80th person in the list has still left over 500 reviews.

Now we want to create a balanced dataset -- i.e., we want the same number of reviews of each reviewer. We'll go through all our reviews again and keep only those reviews written by the 80 authors we identified above, and only 500 reviews from each author. Below, keep_ids is a dictionary which we'll use to keep count of how many reviews we have from each author.


In [173]:
keep_ids = {pr[0] : 0 for pr in prolific_reviewers}

In [174]:
keep_reviews = []
for review in reviews:
    uid = review['user_id']
    if uid in keep_ids and keep_ids[uid] < 500:
        keep_reviews.append(review)
        keep_ids[uid] += 1

Preparing the training data

Now we'll split the reviews we kept into two lists: one for the texts of the reviews, and another for the author ids. The two lists are implicitly associated by index (i.e., the first text in our texts array was written by the first author in our authors array). In machine learning, we refer to these as "instances" and "labels". Instances are the things we use to learn from, and the labels are the things we are trying to learn.


In [175]:
texts = [review['text'] for review in keep_reviews]
authors = [review['user_id'] for review in keep_reviews]

Next, we need to import some things from the scikit-learn library. Specifically, we need a vectorizer (something that transforms our texts into a numerical representation that's easier to work with) and a classifier (the thing that learns how to discriminate based on labeled examples). We'll be using TfidfVectorizer, which transforms our text into vectors with tf-idf weighting and a LinearSVM, which is a Support Vector Machine with a linear kernel -- a kernel that is often used for text classification tasks. We'll also import a helper function called train_test_split. We'll use this to split our data into a training set and a test set. The classifier will learn patterns from the training set, and then we'll make sure that it actually works by seeing if it can correctly predict the authors in the held-out test set.


In [210]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

Now we can transform our texts into vectors by setting up a vectorizer and giving it the list of our texts.


In [211]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
print(vectors.shape)


(40000, 68619)

Our vectors variable contains a sparse matrix with shape 40000 (which is 500 texts times by 80 authors -- the number of texts we kept) by 68619 which is the number of features. Our features are single words (uni-grams) and each text is represented by an indication of how often that word appears in that text (which is nearly always 0 as we have 68619 unique words in the dataset).

Often in machine learning, you'll see the instances (texts in our case) refered to as Xs and the labels as ys. You can think about machine learning tasks as a function y = f(x). We have x (the review text) and we want to know y (the author's ID). The SVM attempts to learn a function f that can map the texts to the labels. We'll follow this convention, and break our texts into X_train (the texts we'll show the SVM as learning examples) and X_test (the texts we won't show to the SVM so we can see if it's able to predict the correct authors for these texts based on the patterns it learned from the texts in X_train). We'll simliarly break our labels (the author ids) into two arrays as well: y_train and y_test. We can use the function provided by scikit-learn to handle taking a random sample of our texts and labels, while making sure that the indices still correspond, as follows.


In [212]:
# We use a fixed random_state to ensure that the same random sampling is used every time the code is run.
X_train, X_test, y_train, y_test = train_test_split(vectors, authors, test_size=0.2, random_state=1337)

We now have 32000 texts (80% of our data) to train on and 8000 (20%) for testing:


In [213]:
print(X_train.shape, X_test.shape)


(32000, 68619) (8000, 68619)

Training and testing a classifier

We first need to call fit on our classifier and pass in the learning texts and labels. Then we'll call predict to get predictions on the test data, and look at some metrics to see how well it did.


In [214]:
%%time

svm = LinearSVC()
svm.fit(X_train, y_train)


CPU times: user 16.7 s, sys: 119 ms, total: 16.8 s
Wall time: 16.8 s

We can now make predictions on the test set (note that the SVM has never seen the labels from the test set that are stored in y_test). The SVM will output whichever user_id it thinks is most likely to be the author of that review, for each review we pass in.


In [215]:
predictions = svm.predict(X_test)
print(list(predictions[0:10]))


['cMEtAiW60I5wE_vLfTxoJQ', 'EiP1OFgs-XGcKZux0OKWIA', 'V-BbqKqO8anwplGRx9Q5aQ', 'FIk4lQQu1eTe2EpzQ4xhBA', 'J3ucveGKKJDvtuCNnb_x0g', 'FIk4lQQu1eTe2EpzQ4xhBA', '1kNsEAhGU8d8xugMuXJGFA', 'j6wLUT0ZXi-x0otelYIFpA', 'DK57YibC5ShBmqQl97CKog', 'sYQyXDjGaJj7wfaqz5u8KQ']

If the classifier did a good job, the predictions should be similar to the test labels (y_test). We can look at the first 10 manually to see how this works:


In [216]:
print(y_test[:10])


['yT_QCcnq-QGipWWuzIpvtw', 'EiP1OFgs-XGcKZux0OKWIA', 'V-BbqKqO8anwplGRx9Q5aQ', 'FIk4lQQu1eTe2EpzQ4xhBA', 'J3ucveGKKJDvtuCNnb_x0g', 'FIk4lQQu1eTe2EpzQ4xhBA', '1kNsEAhGU8d8xugMuXJGFA', 'j6wLUT0ZXi-x0otelYIFpA', 'DK57YibC5ShBmqQl97CKog', 'sYQyXDjGaJj7wfaqz5u8KQ']

It got all but the first one correct! We can use some more helper functions from scikit-learn to get a summarised view on the results for all 8000 predictions. Accuracy is simply the number of predictions it got right divided by all the predictions:


In [217]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))


0.86475

We have 80 authors in our dataset, so if were were simply guessing the authors we'd expect to get an accuracy score of 0.0125. Instead, our SVM can correctly predict the author of the review more than 86% of the time.

Improving our system

Let's see if we can tune some parameters and do even better.

The following vectorizer looks at words individually (uni-grams) but also looks at pairs of words (bi-grams). This makes our feature space much larger, so we'll need a bit more processing time, both to create the vectors, and to train the SVM using the vectors. To alleviate this issue, we'll tell the vectorizer to ignore all words and word-pairs that don't appear in at least five different reviews -- there are a lot of very rarely used words, and we can't learn anything from these anyway.


In [224]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=5)
vectors = vectorizer.fit_transform(texts)
print(vectors.shape)


(40000, 174396)

Now we have 174396 features, about double the number we had before. We can create our train test split again, and run the fitting and prediction code as before.


In [225]:
X_train, X_test, y_train, y_test = train_test_split(vectors, authors, test_size=0.2, random_state=1337)

In [226]:
svm = LinearSVC()
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)
print(accuracy_score(y_test, predictions))


0.90525

The new vectorization parameters are a bit better, and we can now identify the correct author correctly more than 90% of the time. Considering that some of the reviews are only a few sentences long, it is perhaps surprising that the writing styles are distinctive enough.

Conclusion

In this post, you learnt how to use Python and scikit-learn for authorship identification. You can now prepare text data and train a simple classifier. You know a bit about the vectorization process, and the choices that have to be made regarding data preparation.

It might seem useless to predict labels that we already have, but we could use this same principle for many practical tasks. For example, it could help find people who have more than one Yelp account for the purposes of promoting their own establishment or leaving bad reviews for competitors. It is also useful in forensic linguistics when the true authorship of someone's will or suicide note is often questioned, and it can be used to prove the authorship of disputed literary works, such as Shakespeare's plays or books written under pseudonyms.


In [ ]: